Reconstruction of Markov Random Fields from Samples: 
Some Observations and Algorithms 

Guy Bresler* Elchanan Mossel' Allan Sly* 
March 9, 2010 

o 

(N 

H , 

Abstract 

Markov random fields are used to model high dimensional distributions in a number 
of applied areas. Much recent interest has been devoted to the reconstruction of the 
dependency structure from independent samples from the Markov random fields. We 
analyze a simple algorithm for reconstructing the underlying graph defining a Markov 
random field on n nodes and maximum degree d given observations. We show that 
under mild non-degeneracy conditions it reconstructs the generating graph with high 
probability using 0(de~ 2 S~ 4 \ogn) samples where e,8 depend on the local interactions. 
For most local interaction e,5 are of order exp(— 0(d)). 

Our results are optimal as a function of n up to a multiplicative constant depending 
on d and the strength of the local interactions. Our results seem to be the first results 
for general models that guarantee that the generating model is reconstructed. Further- 
more, we provide explicit 0(ra d+2 e~ 2 (5~ 4 log n) running time bound. In cases where the 
, measure on the graph has correlation decay, the running time is 0(n 2 log n) for all fixed 

d. We also discuss the effect of observing noisy samples and show that as long as the 
^ noise level is low, our algorithm is effective. On the other hand, we construct an exam- 

ple where large noise implies non-identifiability even for generic noise and interactions. 
C*** , Finally, we briefly show that in some simple cases, models with hidden nodes can also 

be recovered. 
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In this paper we consider the problem of reconstructing the graph structure of a Markov 
random held from independent and identically distributed samples. Markov random helds 
(MRF) provide a very general framework for dehning high dimensional distributions and 
the reconstruction of the MRF from observations has attracted much recent interest, in 
particular in biology, see e.g. [9] and a list of related references [10] . 
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1.1 Our Results 



We give sharp, up to a multiplicative constant, estimates for the number of independent 
samples needed to infer the underlying graph of a Markov random field of bounded degree. 
In Theorem Q] we use a simple information-theoretic argument to show that fl(dlogn) 
samples are required to reconstruct a randomly selected graph on n vertices with maximum 
degree at most d. Then in Theorems [2] and [3] we propose two algorithms for reconstruction 
that use only 0(de~ 2 5~ 4 log n) where e and 5 are lower bounds on marginal distributions in 
the neighbourhood of a vertex. Under mild non-degeneracy conditions e, 5 = exp(— 0(d)) 
and for some models e, 5 = poly _1 d. Examples of the later model include the hardcore 
model with fugacity A = ©(^)- Our main focus is on the reconstruction of sparse MRFs 
case where d is fixed in which case e and 5 are constant. The two theorems differ in their 
running time and the required non-degeneracy conditions. It is clear that non-degeneracy 
conditions are needed to insure that there is a unique graph associated with the observed 
probability distribution. 

In addition to the fully-observed setting in which samples of all variables are available, 
we extend our algorithm in several directions. In Section [5] we consider the problem of noisy 
observations. In subsection 15.11 we show by way of an example that if some of the random 
variables are perturbed by noise then it is in general impossible to reconstruct the graph 
structure with probability approaching 1. Conversely, when the noise is relatively weak as 
compared to the coupling strengths between random variables, we show that the algorithms 
used in Theorems [2] and [3] reconstruct the graph with high probability. Furthermore, we 
study the problem of reconstruction with partial observations, i.e. samples from only a 
subset of the nodes are available. In Theorem [5] we provide sufficient conditions on the 
probability distribution for correct reconstruction. 

Chickering [2] showed that maximum-likelihood estimation of the underlying graph of 
a Markov random field is NP-complete. This does not contradict our results which assume 
that the data is generated from a model (or a model with a small amount of noise) . Although 
the algorithm we propose runs in time polynomial in the size of the graph, the dependence 
on degree (the run-time is 0(n d+2 e~ 2 5~ 4: logn)) may impose too high a computational 
cost for some applications. Indeed, for some Markov random fields exhibiting a decay of 
correlation a vast improvement can be realized: a modified version of the algorithm runs in 
time 0(dn 2 e~ 2 5~ 4: log n). This is proven in Theorem[4l 

1.2 Related Work 

Chow and Liu [1] considered the problem of estimating Markov random fields whose un- 
derlying graphs are trees, and provided an efficient (polynomial-time) algorithm based on 
the fact that in the tree case maximum-likelihood estimation amounts to the computation 
of a maximum-weight spanning tree with edge weights equal to pairwise empirical mutual 
information. Unfortunately, their approach does not generalize to the estimation of Markov 
random fields whose graphs have cycles. Much work in mathematical biology is devoted 
to reconstructing tree Markov fields when there are hidden models. For trees, given data 
that is generated from the model, the tree can be reconstructed efficiently from samples at 
a subset of the nodes given mild non-degeneracy conditions. See [12 ^1131 [TT] for some of the 
most recent and tightest results in this setup. 

The most closely related works are [3] and [5]. These can be compared in terms of 
sampling complexity, running time as well as the generality of the models to which they 



2 



apply. These are summarized in the Table below. The first line refers to the type of models 
that the method cover: Does the model allow clique interactions of just edge interactions? 
The next two lines refer to requirements on the strength of interactions: are they not 
required to be too weak / are only edges with strong interactions returned? are they not 
required to be too strong? The next line refers to the hardness of verifying if a given 
model satisfies the conditions of the algorithm (where X denoted that the verification is 
exponential in the size of the model). The following line refers to the following question: 
is there a guarantee that the generating model is returned with high probability. The final 
two lines refers to computational and sampling complexity where Cd denotes constants that 
depend on d. 



Method 


AKN [3J 


WRL [5J 


Alg 


High Temp Alg 


Cliques 


V 


X 




V 


No Int. Low. Bd. 


V 


X 


X 


X 


No Int. Upp. Bd. 


V 


X 


V 


X 


Verifiable Conds. 


V 


X 


V 


V 


Output Gen. Model 


X 


V 


V 


V 


Comp. Compl. 


n O(d) 


n 5 


n u(d) 


Cdfi 1 log n 


Sampl. Compl. 


n O(d) 


poly(d) logn 


c d log n 


Cd log n 



Abbeel, et al [3] considered the problem of reconstructing graphical models based on 
factor graphs, and proposed a polynomial time and sample complexity algorithm. However, 
the goal of their algorithm was not to reconstruct the true structure, but rather to produce a 
model whose distribution is close in Kullback-Leibler divergence to the true distribution. In 
applications it is often of interest to reconstruct the true structure which give some insights 
into the underlying structure of the inferred model. 

Note furthermore that two networks that differ only in the neighborhood of one node 
will have 0(1) KL distance. Therefore, even in cases where it is promised that the KL 
distance between the generating distribution and any other distribution defined by another 
graph is as large as possible, the lower bounds on the KL distance is 0(1) . Plugging this 
into the bounds in [3] yields a polynomial sampling complexity in the size of the network 
in order to find the generating network compared to our logarithmic sampling complexity. 
For other work based on minimizing the KL divergence see the references in pj. 

The same problem as in the present work (but restricted to the Ising model) was studied 
by Wainwright, et al [5], where an algorithm based on £i-regularization was introduced. 
The algorithm presented is efficient also for dense graphs with running time 0(n 5 ) but is 
applicable only in very restricted settings. The work only applies to the Ising model and 
more importantly only models with edge interactions (no larger cliques are allowed). The 
most important restrictions are the two conditions in the paper (Al and A2). Condition Al 
requires (among other things) that the "covariates [spins] do not become overly dependent" . 
Verifying when the conditions holds seems hard. However, it is easy to see that this condition 
fails for standard models such as the Ising model on the lattice or on random d-regular 
graphs when the model is at low temperatures, i.e. for f3 > | log(l + y/2) in the case of the 
two dimensional Ising model and /3 > tanh -1 (l/(d — 1)) for random d-regular graphs. 

Subsequent to our work being posted on the Arxiv, Santhanam and Wainwright [3] 
again considered essentially the problem for the Ising model, producing nearly matching 
lower and upper bounds on the asymptotic sampling complexity. Again their conditions do 
not apply to the low temperature regime. Another key difference from our work is that they 
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restrict attention to the Ising model, i.e. Markov random fields with pairwise potentials and 
where each variable takes two values. Our results are not limited to pairwise interactions 
and apply to the more general setting of MRFs with potentials on larger cliques. 



2 Preliminaries 

We begin with the definition of Markov random field. 

Definition 1. On a graph G = (V,E), a Markov random field is a distribution X taking 
values in A v , for some finite set A with \A\ = A, which satisfies the Markov property 

P(X(W),X(U)\X(S)) = P(X(W)\X(S))P(X(U)\X(S)) (1) 

when W, U and S are disjoint subsets of V such that every path in G from W to U passes 
through S and where X(U) denotes the restriction of X from A v to A u for U C V . 

Famously, by the Hammersley-Clifford Theorem, such distributions can be written in a 
factorized form as 



P(a) = 7F ex P 



(2) 



where Z is a normalizing constant, a ranges over the cliques in G, and Vl/ a : A^ — > MU{— oo} 
are functions called potentials. 

The problem we consider is that of reconstructing the graph G, given k independent 
samples X_ = {X 1 , . . . , X h } from the model. Denote by Gd the set of labeled graphs with 
maximum degree at most d. We assume that the graph G € Gd is from this class. A 
structure estimator (or reconstruction algorithm) G : A kn — > Gd is a ma P from the space 
of possible sample sequences to the set of graphs under consideration. We are interested in 
the asymptotic relationship between the number of nodes in the graph, n, the maximum 
degree d, and the number of samples k that are required. An algorithm using number of 
samples k(n) is deemed successful if in the limit of large n the probability of reconstruction 
error approaches zero. 



3 Lower Bound on Sample Complexity 

Suppose G is selected uniformly at random from The following theorem gives a lower 
bound of Q(dlogn) on the number of samples necessary to reconstruct the graph G. The 
argument is information theoretic, and follows by comparing the number of possible graphs 
with the amount of information available from the samples. 

Theorem 1. Let the graph G be drawn according to the uniform distribution on Then 
there exists a constant c = c(A) > such that if k < cdlogn then for any estimator 
G : X_ — >• Gd, the probability of correct reconstruction is P{G = G) = o(l). 

Remark 1. Note that the theorem above doesn't need to assume anything about the poten- 
tials. The theorem applies for any potentials that are consistent with the generating graph. 
In particular, it is valid both in cases where the graph is "identifiable" given many samples 
and in cases where it isn't. 
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Proof. To begin, we note that the probability of error is minimized by letting G be the 
maximum a posteriori (MAP) decision rule, 

Gmap(AI) = argmax 9GG P[G = g\X]. 

By the optimality of the MAP rule, this bounds the probability of error using any estimator. 
Now, the MAP estimator GmapQQ is a deterministic function of X_. Clearly, if a graph g 
is not in the range of G then the algorithm always makes an error when G = g. Let S be 
the set of graphs in the range of Gmap> so P (error) g G S c ) = 1. We have 

P(error) = ^P(error|G = g)P(G = g) 

= ^P(error|G = g)P(G = g) + ^ P(error|G = g)P(G = g) 

ges ges- 

>Y,P(G = g) = l-^\- 1 

g€S c g€S 

where the last step follows from the fact that \S\ < |X| < A nk . It remains only to express 
the number of graphs with max degree at most d, \Qd\, in terms of the parameters n, d. The 
following lemma gives an adequate bound. 

Lemma 1. Suppose d < n a with a < 1. Then the number of graphs with max degree at 
most d, \Qd\, satisfies 

log 1^1 = n(ndlogn). (4) 



Proof. To make the dependence on n explicit, let U n d be the number of graphs with n 
vertices with maximum degree at most d. We first bound U n+ 2,d in terms of U n> d,- Given a 
graph G with n vertices and degree at most d, add two vertices a and b. Select d distinct 
neighbors v±, . . . , Vd for vertex a, with d labeled edges; there are i^)d\ ways to do this. If Vi 
already has degree d in G, then has at least one neighbor u that is not a neighbor of a, 
since there are only d — 1 other neighbors of a. Remove the edge (vi,u) and place an edge 
labeled i from vertex b to u. This is done for each vertex v±, . . . , Vd, so b has degree at most 
d. The graph G can be reconstructed from the resulting labeled graph on n + 2 vertices as 
follows: remove vertex a, and return the neighbors of b to their correct original neighbors 
(this is possible because the edges are labeled). 

Removing the labels on the edges from a and b sends at most dl 2 edge-labeled graphs 
of this type on n + 2 vertices to the same unlabeled graph. Hence, the number of graphs 
with max degree d on n + 2 vertices is lower bounded as 

It follows that for n even (and greater than 2d + 4) 
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If n is odd, it suffices to note that U n+ i t d > U n< d- Taking the logarithm of equation ([5]) 
yields 

log U n ^d = Q(nd(logn — log cf)) = Q(ndlogn), (6) 
assuming that d < n a with a < 1. □ 

Together with equation ([3]) , Lemma Q] implies that for small enough c, if the number of 
samples k < cdlogn, then 

P(error) > 1 - — — = 1 - o(l). 

\y\ 

This completes the proof of Theorem [TJ □ 

4 Reconstruction 

We now turn to the problem of reconstructing the graph structure of a Markov random 
field from samples. For a vertex v we let N(v) = {u £ V — {v} : (u, v) G E} denote the set 
of neighbors of v. Determining the neighbors of v for every vertex in the graph is sufficient 
to determine all the edges of the graph and hence reconstruct the graph. We test each 
candidate neighborhood of size at most d by using the Markov property, which states that 
for each w G V - (N(v) U {v}) 

P(X(v)\X(N(v)), X(w)) = P(X(v)\X(N(v))) . (7) 

We give two theorems for reconstructing networks; they differ in their non-degeneracy 
conditions and their running time. The first one, immediately below, has more stringent 
non-degeneracy conditions and faster running time. 



4.1 Conditional Two Point Correlation Reconstruction 

Theorem 2. Suppose the graphical model satisfies the following: there exist e, 5 > such 
that for all v G V , if U C V — {v} with \U\ < d and N(v) & U then there exist values 
x v ,x w ,x' w ,x Ul , . . . , x Ul such that for some w G V — (U U {v}) 

\P(X(v) = x v \X(U) = Xu , X(w) = x w ) 

- P(X(v) = x v \X(U) = Xu , X(w) = x' w )\ > e 

and 

\P(X(U) = xu,X(w)=x w )\>5, 
\P(X(u 1 )=x u ,X(w)=x'j\ >S. 

Then with the constant C = ( ^^i^J + Cij , when k > Ccilogn, there exists an estimator 

G(X) such that the probability of correct reconstruction is P(G = G(X)) = 1 — 0(n~ Cl ). 
The estimator G is efficiently computable in 0(n d+2 \ogn) operations. 

Remark 2. Condition ([8j) captures the notion that each edge should have sufficient strength. 
Condition ([9]) is required so that we can accurately calculate the empirical conditional 
probabilities. 
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Proof. Let P denote the empirical probability measure from the k samples. Azuma's in- 
equality gives that if Y ~ Hm(k,p) then 

P(\Y - kp\ > 7 fc) < 2exp(-2 7 2 /c) 
and so for any collection U = {u\, . . . , u{\ C V and xi, . . . ,xi £ A we have 

P (jP(X(U) = x v ) - P{X(U) = x v )\ < 7) < 2exp(-2 7 2 A:). (10) 



There are ^4'(") < A l n l such choices of u\,...,ui and x±,...,xi. An application of the 
union bound implies that with probability at least 1 — A l n l 2 exp(— 2 7 2 fc) it holds that 



P(X(U)=x u )-P(X(U)=x U/ 



<7 



(11) 



for all {ui}\ =l and {xi}\ =l . If we additionally have I < d + 2 and k > C( 7 )<i log n, then 
equation CUD holds with probability at least 1 - A d+2 n d+2 2/n 2 ^ c{ ^ d . Choosing C( 7 ) = 
4^ + Ci, equation (HI]) holds with probability at least 1 - 2A d+2 /n Cl . 
For the remainder of the proof assume (jlip holds. Taking 



1 (e,5) = e5 2 /9, 

we can bound the error in conditional probabilities as 

\P(X{v) = x v \X(U) = x v ) - P{X(v) = x v \X(U) = Xu )\ 
P(X(v)=x v ,X(U)= Xu ) P(X{v)=x v ,X(U)= Xu ) 



(12) 



< 



P(X(U) = x v ) P(X{U) = x v ) 

P{X{v) = x v ,X(U) = x v ) P(X(v) = x v ,X(U) = x v ) 



+ 



P(X(U) = xv) 



1 



P(X(U) = xv) 



P(X(U) = xv) P(X(U) = xv) 

eS 2 



e5 2 
< + 



5 (5 - j)6 ~ 95 9(5 



^)5 



e5 

— + 



e e 
(9 - e5) < 4 



(13) 



For each vertex v £ V we consider all candidate neighborhoods for v, subsets U C V— {v} 
with \U\ < d. The estimate (|13p and the triangle inequality imply that if N(v) C U then 
by the Markov property, 



\P{X(v) = x v \X(U) = xu,X(w) = x w ) 

- P{X{v) = x v \X{U) = xu,X(w) = x'J\ < e/2 

for all w G V and x\, . . . , xi, x w , x' w , x v £ A such that 

P(X(U) = xu,X(w) = x w )\ > 5/2, 

P(X(U)=x u ,X(w)=x'J >5/2. 



(14) 



(15) 



Conversely by conditions ([8|) and (|9]) and the estimate (JT3J) , we have that for any U with 
N(v) U there exists some w £ V and x Ul , . . . , x Ul ,x w ,x' w ,x v £ A such that equation (fT5]) 
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holds but equation does not hold. Thus, choosing the smallest set U such that (H 
holds gives the correct neighborhood. 

To summarize, with number of samples 

, /81(d + 2) 

the algorithm correctly determines the graph G with probability 

P(G(X) = G) > 1 - 2A d+2 /n Cl . 

The analysis of the running time is straightforward. There are n nodes, and for each 
node we consider 0(n d ) neighborhoods. For each candidate neighborhood, we check ap- 
proximately 0(n) nodes and perform a correlation test of complexity O(logn). 

□ 

4.2 General Reconstruction 

While Theorem [2] applies to a wide range of models, condition ((SJ) may occasionally be too 
restrictive. One setting in which condition ((SJ) does not apply is if the marginal spin at some 
vertex v is independent of the marginal spins at all its neighbors, (i.e for all u £ N(v) and 
all x, y £ A we have P(X(v) = x,X(u) = y) = P(X(v) = x)P{X(u) = y). In this case the 
algorithm would incorrectly return the empty set for the neighborhood of v. The weaker 
conditions for Theorem [3] hold on essentially all Markov random fields. In particular, condi- 
tion (|16p says that the potentials are non-degenerate, which is clearly a necessary condition 
in order to recover the graph. Condition (|17j) holds for many models, for example all models 
with soft constraints. This additional generality comes at a computational cost, with the 
algorithm for Theorem 2 having a faster running time, 0(n d+2 logn) versus 0(n 2d+1 logn). 

Theorem 3. For an assignment xjj — (xuu ■ ■ • > x Ul ) and x' Ui (z A, define 

x ll( x Ui ) = ( x ux i ■ ■ ■ j x u . , . . . , x U[ ) 

to be the assignment obtained from x\j by replacing the ith element by x' u .. Suppose there 
exist e,S > such that the following condition holds: for all v £ V , if N(v) = u\, ... ,ui, 
then for each i, 1 < i < I and for any set W C V — (v U N(v)) with \W\ < d there exist 
values x v ,x Ul , . . . , x Ui , . . . , x Ul , x' u £ A and x\y £ A^ w \ such that 

\P(X(v) = x v \X(N(v))=x N(v) ) 

- P(X(v) = x v \X(N(v)) = x* N{v) (x'J)\ >e 

and 

\P(X(N(v)) = x N[v) ,X(W) =x w )\> 5, 
P(X(N(v))=x l N(v) ,X(W) = x w ) 



(17) 
> S. 



Then for some constant C = C(e,5) > 0, if k > Cdlogn then there exists an estimator 
G(X) such that the probability of correct reconstruction is P{G = G(X)) = 1 — o(l). The 
estimator G is computable in time 0(n 2d+l logn). 
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Proof. As in Theorem [2] we can assume that with high probability we have 



P(X(U) = x v ) - P(X(U) = xu ) < 7 (18) 

for all {ui}[ =1 and {xi}[ =1 when I < 2d + \ and k > C{^)d\ogn so we assume that (fTHj) 
holds. For each vertex v G V we consider all candidate neighborhoods for v, subsets 
U = {«!,...,«/} C V — {v} with < I < d. For each candidate neighborhood U, the 
algorithm computes a score 

f(v;U) = 

min max |P(X(w) = x v \X(W) = xw,X(U) = xjj) 

W,i x v ,x w ,xu,x' Ui 

- P(X(v) = x v \X(W) = x w ,X(U) = x\j{x' Ui ))\, 
where for each W,i, the maximum is taken over all x v ,X]y,xu,x' u . , such that 

P(X(W) = x w ,X(U) = x v ) > 5/2 (19) 
P(X(W) = x w ,X(U) = x\j(x' Ui )) > 5/2 

and W C V — ({v} U U) is an arbitrary set of nodes of size d, xw £ ■A d is an arbitrary 
assignment of values to the nodes in W, and 1 < i < I. 

The algorithm selects as the neighborhood of v the largest set U C V — {v} with 
f(v;U) > e/2. It is necessary to check that if U is the true neighborhood of v, then the 
algorithm accepts U, and otherwise the algorithm rejects U. 

Taking 7(e, 5) = e5 2 /9, it follows exactly as in Theorem [2] that the error in each of the 
relevant empirical conditional probabilities satisfies 

\P(X(v) = x v \X(W) = x w ,X(U) = xu) 

- P{X{v) = x v \X(W) = x w ,X(U) = xu )\<^. (20) 

If U N(v), choosing m G U - N(v), we have when N(v) C W U U 

\P(X(v) = x v \X(W) = x w ,X(U) = xu) - P(X(v) = x v \X(W) = x w ,X(U) = 
= \P(X(v) = x v \X(N(v)) = x N(v) ) - P(X(v) = x v \X(N(v)) = x N(v) )\ 
= 0, 

by the Markov property (J7|). Assuming that equation (|18p holds with 7 chosen as in f)12|) . the 
estimation error in f(v; U) is at most e/2 by equation (j20l) . and it holds that f(v; U) < e/2 
for each U ^ N(v). Thus all U ^ N(v) are rejected. If U = N(v), then by the Markov 
property ([7]) and the conditions (fTHl) and (fT7|l . for any i and W C V, 

\P(X(v) = x v \X(W) = x w ,X(U) = xu) - P(X(v) = x v \X(W) = x w ,X(U) = x\j(x' u J)\ 
= \P(X(v) = x v \X(N(v)) = x N(v) ) - P(X(v) = x v \X(N(v)) = x l N(v) (x' Ut ))\ 
> e 

for some x v , xw, xu, x' u .. The error in f(v ; U) is less than e/2 as before, hence f(v; U) > e/2 
for U = N(v). Since U = N(v) is the largest set that is not rejected, the algorithm correctly 
determines the neighborhood of v for every v G V when (I18p holds. 
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To summarize, with number of samples 

, /81(2d + l) _\ 

the algorithm correctly determines the graph G with probability 

P{G{X) = G) > 1 - 2A 2d+1 /n Cl . 
The analysis of the running time is similar to the previous algorithm. □ 

4.3 Non-degeneracy of Models 

We can expect conditions (|16p and ()17p to hold in essentially all models of interest. The 
following proposition shows that they hold for any model with soft constraints. 

Proposition 1 (Models with soft constraints). In a graphical model with maximum degree 
d given by equation ^) suppose that all the potentials *$> uv satisfy \\^ U v\\oo < K and 

max \^uv(xi,x 2 ) - ^uv(x3,x 2 ) - ^uvixi^i) + ^ uv (x 3 ,x 4: )\ > (21) 

Xl,X2,X3,X4eA 

for some 7 > 0. Then there exist e,S > depending only on d,K and 7 such that the 
hypothesis of Theorem [3] holds. 

Proof. It is clear that for some sufficiently small 5 = 5(d, m, K) > we have that for all 
Ui,..., u 2 d+i G V and x Ul , . . . , x U2d+1 £ A that 

P(X(ui ) = x Ul , . . . , X(u 2d +i) = x U2d+1 ) > 5. (22) 

Now suppose that u±, . . . ,ui is the neighborhood of v. Then for any 1 < i < I it follows from 
equation PTT) that there exists x v , x' v , x Ui ,x' u . G A such that for any x Ul . . . , x Ui _ t , x Ui+1 , . . . , x Ul E 
A, 



P{X{v) = 


= x v \X(m) = 


' 3 ■ 


..,X( Ui ) 


- x' 


,.,X(ui) ~- 


= x Ul 


P{X(v) = 


-- x' v \X(ui) = 


" •&U\ } • 


..,X( Ui ) 


= X 'uo ■ 


..,X(ui) -- 


= x Ul 




x v \X{ui) = 


X Ul ? . . 


.,X( Ui ): 


= x Ui , . . 


.,X( Ul ) = 


- x Ul ) 


~ P(X(v) = 


x'^Xiux) = 


X Ul , . . 


.,X( Ui ) : 


= Xv,i 1 ■ ■ 


.,X( Ul ) = 


- x Ul ) 


Combining with equation ( 


[22]). condition 


follows. 









□ 

Although the results to follow hold more generally, for ease of exposition we will keep 
in mind the example of the Ising model with no external magnetic field, 



P(x) 



exp J ^2 PuvX u x v J , (23) 



where (3 UV S R are coupling constants and Z is a normalizing constant. 

The following lemma gives explicit bounds on e, 5 in terms of bounds on the coupling 
constants in the Ising model, showing that the conditions of Theorem [3] can be expected to 
hold quite generally. 
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Proposition 2. Consider the Ising model with all parameters satisfying 

< c < |/%| < C 

on a graph G with max degree at most d . Then the conditions (|16p and (|17p of Theorem [3] 
are satisfied with 

tanh(2c) 



e > 



and 



2C 2 + 2(7- 2 



S > 



2 2d 



Proof. Fix a vertex v € V and let w S iV('u) be any vertex in the neighborhood of v. Let 
R = N(v) \ {w} be the other neighbors of v. Then 



P(X(v) = 1\X(R) = x R ,X(w) = x w ) 

= P{X{v) = 1,X(R) = x R ,X(w) = x w ) 

P(X(v) = 1, X(R) = x R , X(w) = x w ) + P(X(v) = 0, X(R) = x R , X(w) = x v 



(24) 



exp |^ ^2j£R •Ejfijv %wfiwvj ~i~ GXp |^ R ■Ejfijv X W (3 U 

Defining 

A := exp ^ XjPj v 

we have from (12411 that 



\P(X(v) = 1\X(R) = x R ,X(w) = 1) - P(X(v) = 1\X(R) = x R ,X(w) = -1) 
Ae@ wv Ae~P wv 



J^fjftwv _J_ lg ftwv y4.C $wv _|_ 1 gfiwv 

tut; | 



^4 _|_ j\2( e 2/3 wv _|_ g-2/3^) _|_ ^ 
^2/ e 2|^| _ e -2|^h 
^4 _|_ ^42 ^2\fi wv I _|_ g— 2|/3 U „_,|^ _|_ ^ 



-2|/9„ 



> 



tanh(2|/3„ 



42 + e 2|/3„„| + g-21/3^,1 + ^4-2 - 2^42 + 2,4-2 • 

It is possible to choose the spins xr in such a way that e~ c < A < e c . Thus the expression 
above is at least 

tanh(2c) 
2e 2C + 2e -2C • 

Moreover, the probability of any assignment of 2d spins can be very crudely bounded as 

a -idC 



P(X(ix) = x h , . . .,X(i 2 d) = x i2d ) > 



2 2d 



□ 
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4.4 0(n 2 logn) Algorithm For Models with Correlation Decay 

The reconstruction algorithm runs in polynomial time 0(dn 2d+1 Inn). It would be desirable 
for the degree of the polynomial to be independent of d and this can be achieved for Markov 
random fields with exponential decay of correlations. For two vertices u, v E V let d(u, v) 
denote the graph distance and let dc(u,v) denote the correlation between the spins at u 
and v defined as 

d c (u, v) = \P(X(u) = x u , X{v) = x v ) - P(X(u) = x u )P(X{v) =x v )\. 

eA 

If the interactions are sufficiently weak the graph will satisfy the Dobrushin-Shlosman con- 
dition (see e.g. [8]) and there will be exponential decay of correlations between vertices. 

Theorem 4. Suppose that G and X satisfy the hypothesis of Theorem [3] and that for 
all u,v G V, dc(u,v) < exp(—ad(u,v)) and there exists some k > such that for all 
(u,v) G E, dc(u,v) > k. Then for some constant C = C(a,K,e,S) > 0, if k > Cdlogn 
then there exists an estimator G(X) such that the probability of correct reconstruction is 

P(G = G(X)) = 1 — o(l) and the algorithm runtime is 0(nd « + dn Inn) with high 
probability. 

Proof. Denote the correlation neighborhood of a vertex v as Nc(v) = {u £ V : dc(u,v) > 
k/2} where dc(u,v) is the empirical correlation of u and v. For large enough C with high 
probability for all v £ V we have that N(v) C Nc(v) C {u € V : d(u,v) < hl W K ^ |, Now 
the size of \{u G V : d(u,v) < ln (V K ) 1 1 < d < Q / which is independent of n. 

When reconstructing the neighborhood of a vertex v we modify the algorithm in Theo- 
rem [3] to only test candidate neighborhoods U and sets W which are subsets of Nc(v). The 
algorithm restricted to the smaller range of possible neighborhoods correctly reconstructs 
the graph since the true neighborhood of a vertex is always in its correlation neighborhood. 
For each vertex v the total number of choices of candidate neighborhoods U and sets W 

dln(4/«) 

the algorithm has to check is 0{d °> ) so running the reconstruction algorithm takes 

dln(4/«) „ 

0(nd a ) operations. It takes O (dn Inn) operations to calculate all the correlations 
which for large n dominates the run time. □ 

5 Noisy and Incomplete Observations 

More generally there is the problem of reconstructing a Markov random field from noisy 
observations. In this setting we observe Y_ = {Y 1 , . . . , Y k } instead of X — {X 1 , . . . , X k } 
where each Yj is a noisy version of X%. The algorithm in Theorem [3] is robust to small 
amounts of noise, even when the errors in different vertices are not necessarily independent. 
One sufficient condition is that there exist < e' < e and < 5' < 5 such that for any 
2d + 1 vertices v\, . . . , «2d+i and states x±, . . . , x 2 d+i we have that 

\P(X(vi) = xi, . . . , X(v 2d ) = x 2d ) - P(Y( Vl ) = xi, . . . , Y(v 2d ) = x 2d )\ < S'/2 

and 

\P(X(v 2d+ i) = x 2d+ i\X(vi) = xi, . . . ,X(v 2d ) = x 2d ) 

-P{Y{v 2d+l ) = x 2d+1 \Y{v 1 ) = x 1 ,...,Y{v 2d ) = x 2d )\ < e'/2. 

For some C' = C'(e, e' , 5, 5') > with k = Cdlogn samples the reconstruction algorithm of 
Theorem [3] correctly reconstructs the graph G with high probability (the same proof holds). 
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5.1 An Example of Non-Ident inability 

Without assumptions on the underlying model or noise, the Markov random field is not in 
general identifiable. In other words, a single probability distribution might correspond to 
two different graph structures. Thus, the problem of reconstruction is not well-defined in 
such a case. The next example shows that even in the Ising model, under unknown noise it 
is impossible to distinguish between a graph with 3 vertices and 2 edges and a graph with 
3 vertices and 3 edges. 

Example 1. Let V = {vi,V2,vs} be a set of 3 vertices and let G and G be two graphs with 
vertex set V and edge sets {(ui,u 2 ), (ui,us)} and {(ui,U2), (1*1,^3), (112, U3)} respectively. 
Let P and P be Ising models on G and G with edge interactions /3i2,/?i3 and P\2,Piz,lh.3 
respectively, i.e. 

P[X] = ^exp{p 12 X(u 1 )X(u2) + p 13 X(u 1 )X(u 3 )) 

P[X] = i exp (p 12 X{ Ul )X(u2) + A3X( Ul )X(u 3 ) + fozX{u2)X{uz)) . 

Suppose that X'{u\), a noisy version of the spin X{u\), is observed which is equal to X{u\) 
with probability p and —X(u\) with probability 1 — p for some random unknown p while the 
spins X(u2) and X{u%) are observed perfectly. This is equivalent to adding a new vertex 
u[ to G and G with an extra edge (ui,u[) and potential VP( uljtt ') = /3iiiX(ui)X(ui). The 
spin at u[ then represents the noisy observation of the spin at u\. Suppose that all the j3 
and f3 are chosen independently with N(0, 1) distribution and let V and V be the random 
noisy distributions on J\i u f u ^ u ^) . Then the total variation distance between V and V is 
less than 1 and so the graph structure is not identifiable as we shall show below. 

By the symmetry of the Ising model with no external field the random element V can 
be parameterized by (pi'2,Pi'3,P23) G [0, l] 3 where p V2 = PiX,^ = 1,X U2 = l),pi>3 = 
P(X U > = \,X Uz = 1),P23 = P{X U2 = l;^w 3 = 1)- These parameters are given by 

Pij = fcO3ii)M0y) + h(-pn)h(-pij) 

where h(/3) = e $+ e -p ■ Let ip be the function ip : M 3 — > [0, l] 3 which maps (Pw , P12, P13) ^ 
(Pi'2>Pi'3)P23) and let J v be its Jacobian. Then det(J^(l, 1, 1)) > and by continuity 
the Jacobian is positive in a neighborhood of (1,1,1). It follows that the random vec- 
tor (pi'2)Pi'3)2?23) has a density with respect to Lebesgue measure in a neighborhood of 

{2h{l)\2h{l)\2h{lf). 

Now let ip be the function tp : M 3 — > [0, l] 3 which maps (Pw , P12, P13, P23) {Pi'2,Pi'3,P23)- 
If we fix (3 2 3 = then ip = ip induces a positive density in the random vector (p^,pv3,P23) 
in a neighborhood of (2/i(l) 2 , 2/i(l) 2 , 2/i(l) 2 ). By continuity this also holds when | ^23 1 is 
small enough and so (pv2,pv3,P23) has a positive density around (2h(l) 2 , 2h(l) 2 , 2/i(l) 2 ). 
Hence we have that both V and have positive densities in an overlapping region so their 
total variation distance is less than 1 and so the graph structure is not identifiable. 

5.2 Models With Hidden Variables 

A related question is can we identify if a vertex is missing and if so where it fits into the 
graph. Under the assumption that the vertices all have degree at least 3 and the graph is 
triangle-free we can recover missing vertices under mild assumptions. 
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Theorem 5. Suppose that the hypothesis of Theorem^ holds for some Markov random field 
X based on a triangle-free graph with minimum degree at least 3 and maximum degree d' . 
Let rcy such that for any two points v, v' E V — V* we have d(v,v') > 3 and suppose 
we are given samples from X* , the restriction of X to V* with which to reconstruct G. 

Suppose the following condition also holds: for all v E V if V\,v 2 E N(v) and U = 
N(v) U N(vi) - {v,vi,v 2 } andWcV- (N(v) U N(vi)) with \ W\ < 2d then there exists 
some x Vl ,x V2 ,x' V2 ,xu,x\v such that 



and 



\P(X( Vl ) = x n \X(W) = x w ,X{U) = xu,X(v 2 ) = x V2 ) 

- P{X(y x ) = x Vl \X(W) = x w ,X(U) = xu,X(v 2 ) = x'J\ > e 

\P(X(W) = x w ,X(U) = Xu ,X{v 2 ) = x V2 )\ > 5, 
\P(X(W) = x w ,X(U) = x v ,X(v 2 ) = x' V2 )\ > S. 



(25) 



(26) 



Then for some constant C = C(e,5) > 0, if k > Cdlogn then there exists an estimator 
G(X*) such that the probability of correct reconstruction is P(G = G(X*)) = 1 — o(l). 

Proof. We apply the algorithm from Theorem [3] to X* setting the maximum degree as 
d = 2d'. The algorithm will output the graph G* = (V*,E*). If v,N(v) C V* then the 
algorithm correctly reconstructs the neighborhood N(v). Any vertex in V* is adjacent to at 
most one missing vertex so suppose that v \ is a vertex adjacent to a missing vertex v. Then 
by condition (|25p and (|26p we have that the algorithm reconstructs the neighborhood of v\ 
as N(v)UN(vi) — {v, vi}. So the edge set E* is exactly all the edges in the induced subgraph 
of V* plus a clique connecting all the neighbors of missing vertices. Since G is triangle-free 
every maximal clique (a clique that cannot be enlarged) of size at least 3 corresponds to a 
missing vertex. 

So to reconstruct G from G* we simply replace every maximal clique in G* with a vertex 
connected to all the vertices in the clique. This exactly reconstructs the graph with high 
probability. 

□ 

Remark 3. The condition that missing vertices are at distance at least 3 is not necessary, 
but this assumption simplifies the algorithm because the cliques corresponding to missing 
vertices are disjoint. A slightly more involved algorithm is able to reconstruct graphs where 
the missing vertices have d(v,v') = 2. 

The following lemma shows that the conditions for recovery of missing vertices in Theo- 
rem [5] are satisfied for a ferromagnetic Ising model satisfying the assumptions of Lemma [2j 

Lemma 2. Consider the ferromagnetic Ising model where all coupling parameters satisfy 

< c < fa < C 

on a triangle-free graph G with minimum degree 3. Then the conditions of Theorem are 
satisfied with 

tanh(2c) 
e - 32e 2(d+i)C( C 2 + C -2)' 

and 

p -4dC 
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Proof. To check the first condition we write 

\P(X(vi) = 1\X(N) = x N ,X(v 2 ) = 1) 

- P(X( Vl ) = 1\X(N) = x N ,X(v 2 ) = -1)| 

= \P(X( Vl ) = 1\X(N) = x N ,X(v 2 ) = l,v = l)P(v = 1\X(N) = x N ,X(v 2 ) = 1) 
+ P(X( Vl ) = 1\X(N) = x N ,X(v 2 ) = l,v = -l)P(v = -1\X(N) = x N ,X(v 2 ) = 1) 

- P(X( Vl ) = 1\X(N) = x N ,X(v 2 ) = -l,v = l)P(v = 1\X(N) = x N ,X(v 2 ) = -1) 

- P(X( Vl ) = 1\X(N) = x N ,X(v 2 ) = -l,v = -l)P(v = -1\X(N) = x N ,X(v 2 ) = -1)| 
= \P(X(v t ) = l\X(N) =x N ,v = l)P(v = 1\X(N) = x N ,X(v 2 ) = 1) 

+ P(X( Vl ) = 1\X(N) = x N ,v = -l)P(v = -1\X(N) = x N ,X(v 2 ) = 1) 

- P(X( Vl ) = 1\X(N) = x N ,v = l)P(v = 1\X(N) = x N ,X(v 2 ) = -1) 

- P(X( Vl ) = 1\X(N) =x N ,v = -l)P(v = -1\X(N) = x N ,X(v 2 ) = -1)| 

where N = N(v)L)N(vi) — {v,vi,v 2 } and where the last step follows by the Markov property 
(since all paths from v\ to v 2 pass through vertices in TV" or through v). Continuing, we 
have that the above is equal to 

|(P(fi = l|JV,t; = 1) - P(vt = l\N,v = -1)) (P(v = l\N,v 2 = 1) - P(v = l\N,v 2 = -1))| . 

(27) 

But by Lemma [2 

| (P( V1 = l\N,v = l)- P(v = l\N,v = -1)) | > 2 g n | l( 2 2 g_ 2 • 
By the ferromagnetic assumption, the second factor can be lower bounded as 

| (P(v = l\N,v 2 = 1) - P(v = l\N,v 2 = -1)) | > w J d+1)c . 
Hence the first condition is satisfied with 

tanh(2c) 



e > 



32e 2(rf+l)C (Q2 + C -2) • 

The second condition, by the same argument as Lemma [21 is satisfied with 

p -4dC 

8> 



2 2d 



□ 



Acknowledgment E.M. thanks Marek Biskup for helpful discussions on models with 
hidden variables. 
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